Scaling-up NLP Pipelines to Process Large Corpora of Clinical Notes.

نویسندگان

G Divita

M Carter

A Redd

Q Zeng

K Gupta

B Trautner

M Samore

A Gundlapalli

چکیده

INTRODUCTION This article is part of the Focus Theme of Methods of Information in Medicine on "Big Data and Analytics in Healthcare". OBJECTIVES This paper describes the scale-up efforts at the VA Salt Lake City Health Care System to address processing large corpora of clinical notes through a natural language processing (NLP) pipeline. The use case described is a current project focused on detecting the presence of an indwelling urinary catheter in hospitalized patients and subsequent catheter-associated urinary tract infections. METHODS An NLP algorithm using v3NLP was developed to detect the presence of an indwelling urinary catheter in hospitalized patients. The algorithm was tested on a small corpus of notes on patients for whom the presence or absence of a catheter was already known (reference standard). In planning for a scale-up, we estimated that the original algorithm would have taken 2.4 days to run on a larger corpus of notes for this project (550,000 notes), and 27 days for a corpus of 6 million records representative of a national sample of notes. We approached scaling-up NLP pipelines through three techniques: pipeline replication via multi-threading, intra-annotator threading for tasks that can be further decomposed, and remote annotator services which enable annotator scale-out. RESULTS The scale-up resulted in reducing the average time to process a record from 206 milliseconds to 17 milliseconds or a 12- fold increase in performance when applied to a corpus of 550,000 notes. CONCLUSIONS Purposely simplistic in nature, these scale-up efforts are the straight forward evolution from small scale NLP processing to larger scale extraction without incurring associated complexities that are inherited by the use of the underlying UIMA framework. These efforts represent generalizable and widely applicable techniques that will aid other computationally complex NLP pipelines that are of need to be scaled out for processing and analyzing big data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Bluima: a UIMA-based NLP Toolkit for Neuroscience

This paper describes Bluima, a natural language processing (NLP) pipeline focusing on the extraction of neuroscientific content and based on the UIMA framework. Bluima builds upon models from biomedical NLP (BioNLP) like specialized tokenizers and lemmatizers. It adds further models and tools specific to neuroscience (e.g. named entity recognizer for neuron or brain region mentions) and provide...

متن کامل

v3NLP Framework: Tools to Build Applications for Extracting Concepts from Clinical Text

INTRODUCTION Substantial amounts of clinically significant information are contained only within the narrative of the clinical notes in electronic medical records. The v3NLP Framework is a set of "best-of-breed" functionalities developed to transform this information into structured data for use in quality improvement, research, population health surveillance, and decision support. BACKGROUND...

متن کامل

Assessing the Corpus Size vs. Similarity Trade-off for Word Embeddings in Clinical NLP

The proliferation of deep learning methods in natural language processing (NLP) and the large amounts of data they often require stands in stark contrast to the relatively data-poor clinical NLP domain. In particular, large text corpora are necessary to build high-quality word embeddings, yet often large corpora that are suitably representative of the target clinical data are unavailable. This ...

متن کامل

Large Scale Clinical Text Processing and Process Optimization

and Objective This tutorial outlines the benefits and challenges of processing large volumes of clinical text with natural language processing (NLP). As NLP becomes more available and is able to tackle more complex problems, the ability to scale to millions of clinical notes must be considered. The Department of Veterans Affairs (VA) has more than 2 billion clinical notes has developed NLP libr...

متن کامل

Feasibility of pooling annotated corpora for clinical concept extraction

Availability of annotated corpora has facilitated application of machine learning algorithms to concept extraction from clinical notes. However, it is expensive to prepare annotated corpora in individual institutions, and pooling of annotated corpora from other institutions is a potential solution. In this paper we investigate whether pooling of corpora from two different sources, can improve p...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Methods of information in medicine

دوره 54 6 شماره

صفحات -

تاریخ انتشار 2015

Scaling-up NLP Pipelines to Process Large Corpora of Clinical Notes.

نویسندگان

چکیده

منابع مشابه

Bluima: a UIMA-based NLP Toolkit for Neuroscience

v3NLP Framework: Tools to Build Applications for Extracting Concepts from Clinical Text

Assessing the Corpus Size vs. Similarity Trade-off for Word Embeddings in Clinical NLP

Large Scale Clinical Text Processing and Process Optimization

Feasibility of pooling annotated corpora for clinical concept extraction

عنوان ژورنال:

اشتراک گذاری